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Abstract. The ongoing unprecedented exponential explosion of available comput- 
ing power, has radically transformed the methods of statistical inference. What 
used to be a small minority of statisticians advocating for the use of priors and a 
strict adherence to bayes theorem, it is now becoming the norm across disciplines. 
The evolutionary direction is now clear. The trend is towards more realistic, flexible 
and complex likelihoods characterized by an ever increasing number of parameters. 
This makes the old question of: What should the prior he ? to acquire a new central 
importance in the modern bayesian theory of inference. Entropic priors provide 
one answer to the problem of prior selection. The general definition of an entropic 
prior has existed since 1988 but it was not until 1998 ||^ that it was found that 
they provide a new notion of complete ignorance. This paper re-introduces the 
family of entropic priors as minimizers of mutual information between the data 
and the parameters, as in but with a small change and a correction. The general 
formalism is then applied to two large classes of models: Discrete probabilistic net- 
works and univariate finite mixtures of gaussians. It is also shown how to perform 
inference by efficiently sampling the corresponding posterior distributions. 

Key words: Bayesian Belief Networks, Mixture Models, Entropic Priors, Markov 
Chain Monte Carlo, MCMC, Generalized Inverse Gaussian distribution, Gamma 
Approximation to GIG 

1. Introduction 

Entropic Priors 0^,0,^ minimize a type of mutual information between the data 
and the parameters H]. Hence, Entropic Priors are the prior models that are most 
ignorant about the data. As Jaynes used to say: they are maximally noncommit- 
tal with respect to missing information. Entropic Priors (as opposed to other prior 
assignments of probability) come with a guarantee: They include only the informa- 
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tion in the likelihood, the initial guess, the hyper-parameter and the possible side 
conditions that are explicitly imposed, and nothing else. Entropic Priors provide 
a general recipe for prior probabilities that allow the enjoyment of the bayesian 
omelet even in high dimensional parameter spaces. 

This paper presents the explicit computation of Entropic Priors for two classes 
of models: General Discrete Probabilistic Networks (a.k.a. Belief Nets, Bayesian 
Nets, BBNs) and for Mixtures of Gaussians Models. These models constitute the 
core of the probabilistic treatment of uncertainty in AI. 

The paper is divided into 5 parts. Section 2, repeats the derivation in |Q (but 
with a small change and a correction) that Entropic Priors minimize mutual infor- 
mation between the data and the parameters. Section 3, presents the computation 
for discrete BBNs. Section 4 shows an application for classification. Section 5 com- 
putes the priors for the Mixture of Gaussians case. Finally some general remarks 
and conclusions are included in Section 6. 



2. Entropic Priors are Most Ignorant Priors 

Given a regular parametric hypothesis space, i.e. a Riemannian manifold of domi- 
nated probability distributions with volume element g^/^{6)d6. Where 17(6') is the 
determinant of the Fisher information at 9. We denote by f{x\9) the density (with 
respect to either Lebesgue or counting measure) of the distribution indexed by 9 
and by 7r(0) a prior density on the parameters 9. The entropic prior is the tt that 
makes the joint distribution 

a 

f{xi,...,x^,9)^7T{9)Y[f{x,\9) (1) 

hardest to discriminate (in the sense of minimizing the KuUback number) from 
the independent model, 

h{xi,. . .,x^)cg^'\9) (X g^'\9) |f[ h{x,)^ (2) 

for a given fix density h[x) on the data space. Where c is a normalization constant 
independent of 9 and the XjS. Notice that c > when the parameter space has finite 
volume. However, the solution to the optimization problem (|^) (and hence, the 
entropic prior) does not depend on c and still makes sense for models with infinite 
volume. Notice further that the setting is coherent in the sense that the rhs of (||) is 
in fact proportional to the density of the model that assigns probabilities to the XjS 
according to h and independently of the 9 which, according with (^, is uniform 
over the surface area of the model. This is true since Fisher information in the 
hypothesis space of a independent observations is a times the Fisher information 
in the hypothesis space of one observation and thus the volume element in the space 
of a observations is o^l'^ g^^'^(9'). i.e., the two volume elements are proportional and 
we assume the proportionality constant is included in c. 
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and 



To simplify the notation let — {xi, . . . ,Xa) and write, 

l{0:h)^ J f{x\e)\og&ldx (3) 



We have, 
Theorem 1 

TT* = argmin/(/7r : hg^^^) (5) 

TT 

where the minimum is taken over all the proper priors on the parameter space, is 
given by the entropic prior: 

7r*(0|a,/i) cxe~"^(^^'')gi/2(^) (6) 

Proof Using Fubbini's theorem, (0),(§) and the fact that tt integrates to one, we 
can write 

/(/TT : hg^^^) ^aj ^{e)I{e : h)de + J 7r{9) log ^^^'^^ " log c (7) 

Therefore using a Lagrange multiplier to enforce the normalization constraint 
(J TT = 1) we can find tt* by solving: 



argmin J |a^(^)/(0 : h) + ^{d) log + A7r(0)| dO 



(8) 



Let L (tt. A) denote the expression inside the curly brackets in (H). The Euler- 
Lagrange equation for the optimal tt* is = given by, 

al + log TT* - log + A + 1 = 0. (9) 

From where we obtain the expression for the entropic prior given by (|^). 
Q.E.D. 

2.1. BUT WHAT DOES IT MEAN? 

First of all it needs to be clear that the above analysis is logically a priori. By 
this I mean that the actual numerical values of the observed data are not used, 
nor is the actual sample size number n of observed i.i.d. data vectors used. The 
parameters a and h of the entropic prior are the carriers of prior information. 
Notice also that, since the derivation was done on a virtual and not actual space 
of a observations, it makes sense to allow a to take non integer values as long as 
a > 0. In fact an irrational a' is immediately obtained if we decide to change (in 
the final formula for the entropic prior) the entropy scale to hits by changing the 
original base of the logarithm in I {9 : h) from e to 2 so that a' = a log 2. It is 
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however incorrect to claim that by starting the derivation with another base for 
the logarithm one will end up with a non integer a' as it was wrongly claimed in 
1^. In fact the objective functions are proportional and they obviously produce 
the same tt*. To see the source of the mistake one just needs to notice that when 
the base of the log in (||) is 2 say, one has to exponentiate 2, and not e, in order 
to solve for tt*. This was first pointed out to me by Ariel Caticha, who then tried 
to build a justification for an entropic prior with fix a = 1 in 



2.1.1. Imaginary a 

Allowing a to be not just a real number but a Clifford number, in particular 
to be a pseudo scalar, opens up a garden of unexplored possibilities. This may 
not be as insane as it first appears to be, if one thinks of the resulting prior 
as the density of a Clifford valued probability measure (see Q]). Moreover, if / 
(entropy) could be justified as S (action) then the resulting prior e^^^^ (relative to 
local ignorance) would take a familiar form. Going with the flow of this (for now) 
applied numerology this would point to current physical theory to be based on 
the order of 10^^ equivalent a priori observations! (i.e. expressing h in geometrized 
units). 



2.2. RECIPES FOR CHOOSING a AND h 



The values of the hyperparameters a and h of the entropic prior need to be fixed 
in order to obtain numerical assignments of probabilities. To fix h we need to 
specify a function (i.e. an a priori density h{x) for the data) which involves, in 
principle, the specification of an infinite number of parameters. Nevertheless, the 
importance of the a priori biases introduced by h are modulated by the value of 
the real positive parameter a. Take a sufficiently close to and the prior will 
be blind to the specific form of h and controlled by the volume element g^^^dO 
(i.e. uniform over the model surface, see |^). There is a close similarity with the 
problem of choosing a kernel and a bandwidth in density estimation. As it is the 
case in density estimation, the specific form of the kernel is not as critical as the 
choice of the smoothness parameter. A natural choice for h is to use h{x) — f{x\9{)) 
where 9q is the best current guess for the value of 6. If we assume the value of 
^0 to be unknown then we can consider the entropic prior model, which is now 
indexed by the 1 + fc parameters (a, 0o)j to be another regular hypothesis space 
that needs a prior on its parameters. The entropic prior on the entropic prior, on 
the entropic prior,. . ., etc is, in principle, computable. The possibility of a chain 
of entropic priors for a was first given to first level in |Q] and for all levels in 
Another general alternative is to use the empirical bayes approach (sec ||] ) . Finally, 
just fixing a to an arbitrary small value (« 1) and using the mle (maximum 
likelihood estimator) or MAP (Maximum A posteriori Probability), with an easy 
to handle conjugate prior, for 9 has been shown to perform well in simulation 
experiments. 
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Figure 1. DAG for the Sprinkler Problem 

3. The Entropic Prior of a Discrete Probabilistic Network 

An understanding of Cox's ^ argument should be sufficient to impose the rules 
of probability to the treatment of uncertainty in AI. But it has taken, however, 
a long heated debate (see and the invention of new efficient methods 

of computation (e.g. the junction tree algorithm, see and the publication of 
Pearl's text to arrive at today's dominant view of a complete probabilistic 
approach. 

3.1. DAGS 

The current recipe for the thinking machine consists of a fully bayesian probabilis- 
tic treatment of a long vector of facts (the data). The main approach for encoding 
prior information about an specific domain of application, is not the prior, but 
the likelihood. An a priori network of conditional independence assumptions is 
typically provided by means of a Directed Acyclic Graph (DAG) that is supposed 
to encode an expert's knowledge of causal relations among observable facts. 

The canonical textbook example is displayed in fig ^. The arrows indicate 
causality. Thus, the presence of the arrow from Cloudy to Rain represents the fact 
that the sky being cloudy is a possible cause for rain. More important is the absence 
of arrows which indicate independence. Thus, the picture shows that conditionally 
on the values of Sprinkler and Rain, Cloudy is independent of WetCrass. The 
entries of the tables of conditional probabilities constitute the parameters of the 
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Figure 2. Example of a DAG 

DAG. In the case of fig there are 9 independent parameters. We can think of a 
DAG as a convenient way to specify a high dimensional submanifold of the space 
of all joint distributions of the variables under consideration. For example, the 
pictured DAG (with unspecified tables) represents a 9 dimensional submanifold 
of the 15 dimensional simplex of all the assignments of probability on the 2^ = 
16 possible observations of the binary variables {C,S,R,W). The DAG in fig || 
specifies the joint distribution of all the variables (C, S, R, W) in terms of the 
parameters 9 (i.e. table entries) as, 

P{C, R, S, W) = P{C)P{R\C)P{S\R)P{W\R, S). (10) 

Each of the factors on the right of equation ( |lo|) can be read off the tables provided 
in fig |l|. For example, 

P{C ^T,R^T,S ^ F,W ^ F) = (0.5)(0.8)(0.5)(0.1) = 0.02 (11) 

In order to provide general formulas for DAGs we number the vector of variables 
by a; = {xi,X2,X'i,XA) — {C, R, S,W) and parameterized the joint distribution 
with a vector 9 of parameters as in, 

6iy,{r,s)=P{W = w\R = r,S^s) (12) 

Thus, labeling F = 1 and T = 2, ( pi] ) becomes, 

P(2, 2, 1, l\9) = (?i2022 (2)^31 (2)041 (2, 1) (13) 

3.2. WHO IS WHO ON A DAG: GENERAL NOTATION 



This section provides some definitions and notations that are needed for writing 
the entropic prior on a general DAG. All the examples refer to fig 0. 
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Directed Graph: An ordered pair (V, E) where F is a set of vertices (e.g. V = 
{1, 2, 3, 4, 5}) and _E C F x is a set of directed edges, e.g., 

= {(1,2), (1,3), (1,4), (2,4), (3,4), (4,5)} 

DAG: A Directed Acyclic Graph is a directed graph without cycles, (e.g. fig ||). 
Parents: pa(fc) denotes the set of parents for the vertices k G V. (e.g. pa(l) = 

0,pa(5) = {4},pa(4) = {1,2,3}). 
Ancestors: an(/c) denotes the set of ancestors oi k G V. (e.g. an(2) = {1}, an(5) = 

{l,2,3,4},an(l) = 0). Clearly, 

an(A:)=pa(fc) |J an(j) (14) 
jepa(fe) 

Ancestors that are not Parents: Denoted by ap(A:) 

ap(fc) — a.n{k) \ pa(fc) (15) 

(e.g. ap(5) = {1, 2, 3}, ap(4) = 0, ap(2) = 0). 
Notation: 

Xpi,(k) = {xj ■■ j e pa{k)} (16) 

e.g. 

2^pa(l) = 4>, a;pa(4) = {xi,X2,X3} 

Notation: denotes the multiple sum over all the possible values of the 

variables that are parents of vertice k G V. e.g. 

E -EEE 

^Cpa(4) 2=1 ^2 X3 



The notation introduced with equation (13) generalizes naturally for any num- 
ber of discrete variables. Given a DAG with set of vertices V we let x = {x/. : k G 
V}. Hence, the joint distribution of the variables of a given DAG is given by, 



p{x\e) = '[[p{xk\xpi,(k),o) 

kev 

= Yl^kxA^pi^ik)) (17) 



kev 



We are now ready to compute. 



3.3. ENTROPY OF A DAG 

Given a DAG, the KuUback number between two sets of parameters 9 and ^ is 

p{x\9) 



1(0 ■.^i) = Ee 



(18) 
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Using (|r^) and interchanging expectation with summation we obtain, 

^fcxfc (2;pa(fe) ) 



kev 



log 



^J'kXk{Xp^i{k)) 



(19) 



Now for each k £ V compute the unconditional expectation in (^) by first condi- 
tioning on the values of a;pa(fe) to obtain, 



Ee 



log 



dkxkiXpnik)) 



f^kxkiXpa{k)) 



Cpa(fc) 



X]^fcj(^pa(fc))l0g 



OkjiXpa{k)) 

Mfcj(a;pa(fe)) 



H&k{Xpi,{k)) ■ Mfc(a^pa(fe))) 



(20) 



where the last equality is a definition and it was assumed that Xk can take rk 
discrete values. Taking expectations over the a;pa(fc) and replacing in ( p^ we obtain, 

H^■^J') = Y1 E P(a;pa(fc)|6') -^(6'fe(a;pa(fe)) : Mfc(a;pa(fc))) ■ (21) 
kevxp^(^k) 

Finally, using the fact that, 

p{xpi,(k)\0) = ^ p{x^p(^k),Xpa{k)\d) 

2^ap(fc) jean(fc) 

= E n ^.-.(^paO)) (22) 
2;ap(fc) jean(fc) 

we obtain the expression for the entropy, 



^(^:Ai) = E E E n ^J^A^P^U))\ ^i^k{xp^{k)) ■■ Mxpe.{k)))- 
fceyxpa(fc) [a;ap(fc) iean(fc) J 

_ (23) 
Thus, formula (^ shows that the total entropy for a DAG is obtained by adding 
the entropies for each node. The entropy of a node is computed as an average of 
all the possible entropies obtained for the different values of the parents of that 
node. In practice formula ( |2^ ) may be too expensive to compute and it may be 
necessary to use a Monte Carlo estimate. 



3.4. VOLUME ELEMENT OF A DAG 



To compute the Fisher metric, write 9 as a long vector and use the fact (see |^) 
that, 

H0 : e + ev) = '^Y. (^y^"' + (24) 
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It then follows immediately from ( ^31 ) that the Fisher matrix is block diagonal. 
Each block corresponds to the {vk — 1) x (rfe — 1) (Fisher matrix Gk{0k{xpa.(k))) 
associated to the fcth node, multiplied by the scalar p(xpa(fc) The determinant, 
g{9), of the Fisher matrix is then given by the product of the determinants of each 
of the blocks. We have 

9i(^)=I[ n I E n ^J-.(^paO)) [ dct Gk{ek{xp.ik))) (25) 

Finally using the fact that Gk is the Fisher matrix of a multinomial with param- 
eters 6'fci(Xpa(fc)), . . ■ ,Okrk{Xpa.{k)) WC havC, 



detGfe (6'fc(a;pa(fc))) = 



(26) 



n 0k3{Xp!,(k)) 



replacing ( |26| ) in ( |25| ) and taking square root we obtain the expression for the 
volume element, 



(rfc-l)/2 



pa(fe) J 



de (27) 



3.5. THE ENTROPIC PRIOR FOR A DAG 

To obtain (|) we use (||), (p2|) and ^) to get, 



k£VXpa,(^k) 



W_Qk]{Xp^{k)) 



1/2 



exp{-Q;p(xpa(fe)|6l)/(6'fc(a;pa(fe)) : Mfc(2^pa(fc)))} (28) 



3.6. POSTERIOR 

Let us assume that there is available a set of N independent observations 

I? = {x«,x(2),...,xW} 



(29) 
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where each x*-*-* — (xf . . . , x"n^ is an \y\ = n dimensional vector containing the 
observed values of the nodes of a general DAG iy.E). As usual the posterior is 
given by Bayes theorem as, 



where the likelihood is given by, 

N 



(30) 



N 



pa(fc) ' 



N n 



(31) 



Let us partition the set of vertices into two groups, those with parents and those 
without (orphans). For the orphan nodes, i.e. for k E V such that pa(fc) — (p and 
for i = 1, 2, . . . , Tfc define 



"fci(0) = {t ■ x''^^ = i} 
and for A; e y with pa(fc) ^ ip and i — 1, 2, . . . , 

nkt{xpa(k)) = {t ■■ 4*^ = * and x':^l^^^ = a;pa(fc)} 
Replacing these counts into (|3^) we obtain, 

me) =11 n n{^^^(-paw)}"''"^^*'^^ 

fc=ia;pa(fe) i=i 



(32) 



(33) 



(34) 



To simplify the notation let us write simply by pk the expression (^2|) which is 
always a probability that depends only on the ancestors of the node k. Let us also 
just write 0ki, nki, Hki instead of 9kiixpa{k))j ■ ■ ■ and keep implicit their dependence 
on given values of the parents. With this notation the posterior becomes. 



TTi9\D,a,fi)^ Yl H Pk 

k£VXpi,(^k) 



K-l)/2 



n 



exp ( -apkOki log ) (35) 



were we have used (20) to write the exponential in (^8|) as a product of rk factors 



4. Example: Naive Bayes 

When the DAG has the form shown in fig ^ the general formulas have simpler 
forms. This case is known as naive bayes and it is often used as an approximation 
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Figure 3. DAG for Naive Bayes 



in discrimination problems. For this case, V — {1, . . . , n}, pa(l) = (/), and for fc ^ 1 
we have pa(A:) = {l},an(fc) = {l},ap(fc) = and, 



The expression for the entropy (p^) becomes. 



(36) 



I{9 : fi) = Ail) + E E ^(^''•(J) '^''(^■)) 

fc=2 J = l 



and the volume element (p7|) reduces to, 



ri n rfc 



1/2 



de 



nnn^-(^) 

I j = l fc=2 i=l 



(37) 



(38) 



The entropic prior is then easily computed by multiplying exp(— : /i)) (ob- 
tained from (|3^)) by 



4.1. POSTERIOR 

For naive bayes the likelihood is given by, 

N 



(39) 



i=l k=2 



12 



C. C. RODRIGUEZ 



Replacing the counts into ( |39| ) we obtain, 



n \ I ri n Tk 

f{D\e) = I n n n n (^^o))"^-^^' i (40) 

vi=l / \i=lfe=2i=l 



Letting, 



m = 
2 



(41) 



we can write the posterior as, 

mD,a,^Ji) (X J exp L{a\og—)6,^ \ (42) 



n n n exp -(a^i, bg ^)0..(j) 



4.2. THE ENTROPIC SAMPLER 

A combination of Gibbs and Metropohs can be used for samphng the posterior 
(|4^). The parameters are naturaUy grouped in blocks 6k, where. 

Ok = Qk{Xpi^(k)) 

rk 

= (0fel,...,efcrj with ^0fe, = 1 (43) 

i=l 

are distributed over the simplex of dimension rfe — 1. It can be readily seen from 
( |4^ ) that the marginal joint distributions of the 6k blocks are all of the generic 
form, 

r 

f{yi,y2,---,yr-i)^l[{y'^'~' e'^^^^} (44) 

with yj > and yr — 1 — X]j'=i Vj- The parameters aj and are different for the 
parent node and for the children nodes. For the parent, 

Uj — 1 + m + n\j — aQ\i w 1 + m + n\j (45) 
ii, = «(log— + (46) 
For the children blocks the parameters are, 

= nk^{j) + ^ - a9ij9kr{j) ~ nkr{j) + ^ (47) 
f3j = a^ylog^-- (48) 
Excellent initial distributions for Metropolis are obtained by using the following, 
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Lemma 1 Let yi, 1/2, • ■ • , 2/r be independent with yj following a Gamma distribu- 
tion with parameters {aj,(3j). Let, 

/orj = l,...,r-l (49) 



yi H h y,. 

then the joint density of the Zj 's is given by, 



n 



OC \ a,+a, + ...+a. (^0) 




where Zr = I — zi — Z2 — ■ ■ ■ — ^r-i • 

Proof Notice that ( |50| ) is a generalization of the classic result for the Dirichlet 
distribution obtained when all the /3j's are equal, in which case the denominator 
becomes proportional to 1. To prove ( pO[ ) just condition on y^. = y so that the 
transformation ( ^9| ) from the ?/j's to the z/s for j = 1, 2, . . . , r — 1 is one to one 
with inverse, 

= ^forj = l,...,r-l (51) 



To show (^ij) just notice that. 



y + y^ 

V y+2Z^=lV^) 

= (i-E^^) (54) 

= Uj Zr (55) 

where we have used (^9| ) and the definition of z^. The probability density of ob- 
serving zi, . . . , Zr-i is then, 

/>oo 

f{zi,...,Zr-i) ^ f{zi,...,Zr^i\yr^y)gr{y) dy (56) 
Jo 

where for j = 1 , . . . , r are the gamma densities of the yj . Using the definition 
of the Zj's given in (^9|), the assumed independence of the y^'s, and the change of 
variables theorem together with (|63), we have. 
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The expression outside the product is the determinant of the Jacobian of the 
transformation mU). This can be seen by noticing that the Jacobian matrix is, 



Zi + Zr Zi 
Z2 Z2 + Zr 

Zy—\ Zy—\ 



Zl 



(58) 



and compute its determinant by subtracting from each column the cohimn that 
foUows, to obtain, 



det J = 



y 



r-l 



Zr 



Zl 
Z2 





and expanding along the last column, 

y 



(59) 



detJ = (ziz; 2 + 22<"^ + ...+2;r-2<~^ + (z,.-l+ZrX"^) 



,r-2 



1 (y. 



(60) 



This proves (]5^) . Replacing (^7|) into (|5^) and using the expressions for the gamma 
densities we obtain. 



r JO 



i=i 



/(zi, . . . oc I Jl z"' ^j — 

(61) 

this is a simple gamma integral. Integrating out and simplifying the z^'s we obtain 

the desired result (|50|). 

Q.E.D. 

To generate approximate samples from (^ ) we use the Lemma but with f3j 
chosen so that, 



C 



aiH httr 



(62) 



3'^ 3 
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where the constant C does not depend on the Zj. To find the Pj just write the left 
side of (HI) in exponential form and use, 

l0g(/3iZi + • • • + A-Z,,) = \0g{Pr) + log ( 1 + ^i^Zl + • • • + ~^^^l^Zr-A 

(63) 

together with, 

log(l + z) = z + o(z) (64) 
we obtain, that in order for (^) to be true, we must have. 



Y^a,=(3j-f3r (65) 



we can then use. 



^r = ^a, (66) 

i=l 

A = ft - /3r + Pr (67) 

Metropolis corrections are needed to correct for the approximations introduced in 



and (64). 



4.3. TEST: CREDIT CARD CLASSIFICATION EXAMPLE 

We tested the performance of the MCMC sampler on a standard set of 10000 data 
records containing the 13 variables in table |l|. Most of the node names are self 



Nodes 


Sizes 


Card = C 


2(4) 


Gender = G 


2 


Country = Y 


3 


Age = A 


9 


State = S 


13 


Education = E 


5 


Marital = M 


2 


Occupation = O 


5 


Total children = T 


6 


Income = I 


8 


House owner = H 


2 


Cars owned = R 


5 


Children home = N 


6 



TABLE 1. Data Records in Ex- 
ample 
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explanatory. Card, originally contained the type of credit card owned by the indi- 
vidual with categories: no card, regular, gold, platinum. These were later reduced 
to only two categories: {no card, regular} and {gold, platinum}. The data con- 
tains individuals from the three north american countries: Mexico, US, Canada. 
However, the majority of records are from the US. The Children home variable 
contains information about the actual number of children living at home with the 
individual. 



4.3.1. The Bayes Classifier 

To test the performance of the entropic sampler we chose at random 100 individuals 
to be used as the observed data and 1000 to test the bayes classifier. The bayes 
classifier simply assigns the category with highest posterior probability. 

Let D be the observed N = 100 records and let X2, . . . ,Xn (here n = 13) be 
the values of all the nodes except the first (i.e. Card) for an individual that we 
want to classify. The bayes classifier allocates xi ^ 1 if, 

Pixi = l\x2,...,x„,D) > P{xi = 2\x2,...,x„,D) (68) 

we compute both sides with, 

P {Xi = j\x2, . . ■ ,Xn,D) = J P {Xi ^ j,e\x2, . . . ,Xn,D) d0 

(X J P {Xi = j,X2, ■ . . ,Xn,0\D) dO 



= jpixi=j,X2,...,Xn\e)nie\D)de (69) 



where we have assumed that the values of the individual to be classified are inde- 
pendent of the observed data D. We use the MCMC sampler to estimate (|9|) for 
j = 1 and j = 2. Thus, if the sampler produces . . . , 6''^*^^ samples from the 
posterior Tr{9\D) we classify xi ^ I if. 



M M 



^Kl,^2,...,2:„|0(*)) >^p(2,:E2,...,2:„|0(*)) (70) 
t=i t=i 

To avoid underflows it is better to use only ratios. A more stable rule is then: 
assign xi = 1 if, 

Y-f, p(2,X2,...,X„|gW) \ p{l,X2,...,X„\0^*^) 

^[ pil,X2,...,Xn\9W)J pil,X2,...,Xn\eW) ^ ' 



4.3.2. Preliminary Results 

Table ^ shows the results of running the sampler with different parameter values. 
The burn column contains the number of complete sweeps performed and discarded 
before collecting samples. The other columns are: M the number of thetas sampled, 
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burn 


M 


N 


inter 


Met 


a 


% succ. 


100 


100 


100 


50 


[30 15] 


10 


82.7 


200 


100 


100 


100 


[5 2] 


0.1 


81.2 


1000 


200 


100 


50 


[2 2] 


1.0 


78.4 


1000 


200 


100 


100 


[1 1] 


1.0 


79.0 


100 


100 


50 


50 


[1 1] 


1.0 


76.3 



TABLE 2. Summary of Simulations 



N the observed sample size, inter the number of discarded sweeps between samples, 
Met is the number of metropolis step corrections for the root node and for the 
children nodes, a is the parameter of the entropic prior and finally, % succ. is the 
percentage of correct classifications on 1000 random tests. 

Notice that the metropolis corrections seem to help but they slow down the 
sampler. Notice also the drop in performance when the sample size becomes 50. 

These results show the adequacy of the entropic sampler for the classification 
task. However, the naive bayes DAG is not competitive with DAGs containing 
more realistic structure for this problem. A simulated annealing search over the 
space of DAGs produces structures showing over 84% success rate in the more 
difficult task of classification with 4 (not just 2) categories of credit card. 



5. Entropic Prior for Mixtures of Gaussians 

The need for flexible, informative, proper priors for mixtures has been in the 



statistician's wish list for a long time (e.g. see |14 ). In this section we derive, from 
first principles, the entropic prior for a finite mixture of gaussians. This seems to 
be the first informative prior for mixtures, derivable from an objective principle. 
The straight forward application of (^ produces a prior that on the one hand 
is remarkably close to the conjugate prior that has been shown most successful 
in simulations, and on the other hand, departs from it in a way that has always 
thought to be desirable but for which there was no known way to implement. 



5.1. THE MODEL 

We consider a finite mixture of fc univariate gaussians with vector of parameters 
Q = (/i, fj, u) where fi £ H'^ is the vector of k means, a e M'^ is the vector of k 
standard deviations and ui e A*^"^ is the mixing probability vector in the {k — 1)- 
dimensional simplex A*^^^. We use the standard missing data model for mixtures, 
i.e., we assume the data is (x, z) has joint density, for x e IR and z e {1, 2, . . . , /c} 
given by, 

fix,z\e)^Lu,Nix;ti,,a,) (72) 

where N{x] a, b) denotes the density of the normal distribution with mean a and 
standard deviation b. The label z is assumed to be missing from the data so that 
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the marginal density of x has the desired mixture form, 

k 

/(x|&)=^c.,iV(a:;A^,,a,) (73) 
i=i 

The trick is to compute the prior on the complete (x, z) likelihood to disentangle 
the expression for the entropy. 

5.2. ENTROPY 

Let 9° = (m, s,oj°) be the initial guess for 9. The KuUback number between two 



distributions (72) with parameters 9 and 9° is, 



Computing the expectation by first conditioning on z wc obtain, 
/(0:n = ^c.,|/(iV(M„a|) :7V(m„s2))+log^| 

Notice that since X]j=i "^i = 1 we can take the 1/2 outside the sum and it will get 
absorbed into the proportionality constant for the entropic prior. 

5.3. VOLUME ELEMENT 



Using (|2J) we can immediately obtain from ( |75| ) the entries of the Fisher matrix. 
The matrix is clearly block diagonal with gaussian blocks for the (/i, a) parameters 
and a multinomial block for the to parameters. From the standard volume elements 
for gaussians and multinomials we can write the full volume element as, 

, ^ da do. ^ 

(n;-.,-j)(n,t,-,"^) 

where we are abusing the notation a bit since duj must be understood as 11^=1 
so that uj G A'^-^ 



5.4. ENTROPIC PRIOR 

Just multiply e~"-f(''^'^°) with (^) to get, 

k ( 



TT{9\a,9°) cx Wexjp 



auJi — — 



j=i K J 
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mi-) o^r^^-'" (77) 



This is a remarkable result. Equation ( |77| ) says that conditional on w all the com- 
ponents of /i and <T are independent and independent of each other. Moreover, 



pLj\uj ^ Tvfmj, ^] (78) 



auj 



aflo; ^ Gammaf^— (79) 

where to obtain (|7|) we have used the change of variables v — aj that produces 
the jacobian v~^^'^. The joint marginal density of ui is obtained by integrating ( [77|) 
over /i and ct coordinates obtaining, up to a proportionality constant that, 

n ^.(3.c.,-+i)/2 (80) 

j=i 



5.5. POSTERIOR 

Let x" = (xi, . . . , x„) be the observed data and let 2;" be the missing labels. As 
usual we shake the bayesian wand to obtain, 

7r(6l,z"|a;",a,6'°) cx /(x"|6l, z")/(z"|6l)7r(e'|a, r) (81) 

« (n;^»p{^%^})(r[".)'(%.n 

For j = 1, . . . ,k define kj £ {1, 2, . . . , 71} by, 

k,^\{t:z,^j}\ (82) 
and replacing these counts into KW) we have, 



7t{9, z"|a:", a, r) cx n J exp J -1 ^ (m, - m,)^ I I n{e\a, 9") (83) 



2a| . . 
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5.6. GIBBS SAMPLER 

Inference is done by sampling (0, z") vectors from the posterior (|8^). To sample 
from (|8^) we use Gibbs sampling, i.e. we cycle over the full conditionals for each of 
the parameters. Let us use the notation | ... to mean given all the other parameters 
and the data. Here are the distributions for each of the terms: 

5.6.1. Conditional for 

When the vector of mixing probabilities uj is given the joint distribution of z" are 
independent multinomials with lo as the parameter and independent of everything 
else. Thus, for i = 1, 2, . . . , n 

I . . . Multi(wi ,uj2,- ■ ■ ,ujk) (84) 



5.6.2. Conditional for fi 

Here again we have the classic problem of computing the posterior distribution for 
the mean of a gaussian given kj independent gaussian observations when the prior 
is the conjugate gaussian. Looking at the first term of ( [77| ) and the right hand side 
of (HI) we get, 

fi,\...^Nia„b'^ (85) 



where, 



(86) 



and 



1 

■J l:Zi—j J 

kj aujj 
^3 *j 



^ + (87) 



5.6.3. Conditional for a 

Collecting all the factors with Oj from (^3|) and the second term from ( |77| ) we 
obtain, 



a,|...->(a|)*("^J-'=^) 



Now let V = a-j, then 



f,{v) ^ f„^{^)iv-^/^ (89) 



Using (H) with (H) we get, 

v = a]\..."^v~"''^exp\—-bv'> (90) 



V 
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where, 



b = 



(91) 
(92) 

(93) 



We can obtain a useful alternative to (90) by doing m = 1/w so that 

fu{u) = fv{u-^)u-'^ 

and we get. 



a - '^ \ . . . u"" ^ exp 



cu 



(94) 



u 

where a, b and c are given by ( |91|), (^, and ( |93D as before. 

The distributions and (|94|) are instances of the so called Generalized In- 
verse Gaussian (or GIG for short, see |p^ ) distribution. The GIG distribution was 
first introduced in relation to hyperbolic distributions in [ |T6| . It can be shown 
that, 



exp <^ — 

u 



du 



BesselK(a, 2V6c) 



(95) 



where the BesselK(a, x) is the modified Bessel function of the third kind. It is the 
solution to the differential equation. 



xy 



{x^ + a^)y = 



(96) 



Thus, (^) and ( p4[ ) are proper provided that 6 > and c > 0. When either 6 = 
or c = (but not both) one of the two becomes a Gamma. As it is indicated in 
the good news about GIGs is that they are log concave and there are universal 
algorithms for generating them. The problem is that the standard off the shelve 
algorithm for log concave densities requires the evaluation of the normalization 
constant, which in this case is too expensive, since it involves evaluating BesselK. 
The following Gamma approximation provides a solution to this problem. 



5.6.4. Gamma Approximation to GIG 

By computer algebra it is possible to find the parameters of a Gamma that best 
fit a given GIG. Let us use the notation, for a > and /3 > 0, 

r(x; a, 13) = e-~^^ for x > (97) 

r(a) 

and let, for a > 0, 6 > and c > 0, 

G{x;a,b,c) = —x°-^^ expi — -cxl for a; > (98) 
Z I a: J 

where Z is the normalization constant given by the right hand side of (|9^). We 
summarize the findings in the next theorem. 
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Theorem 2 The best second order T[x;a* ^ [3*) approximation to G{x; a,b, c) is 
when, 



a —a 



P* =c 



Abe' 

Ahc 
P 



where. 



A = a-l+E 
p = (a-l)A 
E = ^/{a~lf+Abc 



(99) 

(100) 



(101) 
(102) 
(103) 



Proof Here is a summary of what was found with MAPLE. The hmction G{x; a, b, c) 
has a single global maximum at 

X* = A (104) 
Expanding both log likelihoods in Taylor series about x* we get, 

logr(a;; a, /3) = + Ai{x - x*) + ^2(2; - x*f + o{{x - x*f) 



\ogG{x- a, 6, c) = Bo + • (x - a;*) + B2{x - x*f + o{{x - x*f) 
The optimal parameters a* and (3* are the solution to the system of equations 



(105) 
(106) 



Ai{a,P) = 
A2{a,(3) = B2{a,b,c) 



(107) 
(108) 



Q.E.D. 



The Gamma approximation provided by theorem g fits the bulk of the GIG 
very well but the tails of the GIG are always heavier. A few metropolis iterations 
starting from the gamma approximation should be used to correct for the light 
tails. 



5.6.5. Conditional for to 

Collecting all the factors with LUj from (|8^) and all the terms from ( |77| ) we obtain, 



where, 
and, 



aj = kj — aujj + 1/2 w kj + 1/2 



2 / X 2 

^) -logf^ 



2 log A 



(109) 

(110) 
(111) 
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Notice that Pj > and we can use Lemma |l| again to find good starting approxi- 
mations to be corrected with a smah number of metropoHs iterations. 

6. Conclusions and Future Work 

We have provided exphcit formulas for adding objective prior information in two 
general classes of hypothesis spaces: Discrete probabilistic networks and mixtures 
of gaussians models. Many highly successful models are special cases of BBNs. A 
partial list lifted from include, linkage analysis in genetics, Hidden Markov 
Models for speech recognition, Kalman filtering for tracking missiles, and density 
estimation for data compression and coding with turbocodes. It is only natural 
to expect improvements in the performance of these methods if there is available 
cogent prior information that has not been used. This is specially true in high 
dimensional parametric models. 

I am currently investigating alternative/complementary methods to MCMC 
for performing approximate inference with entropic piors. These include, the vari- 
ational bayes approach (see ||l8|), and the Expectation Propagation (EP) method 
of Minka (see 
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